Runbook: Flux Reconciliation Failure

Alert

Prometheus Alert: FluxReconciliationFailure
Grafana Dashboard: Flux CD dashboard
Firing condition: A Flux Kustomization or HelmRelease has been in a failed or not-ready state for more than 15 minutes

Severity

Warning -- Reconciliation failures mean the cluster state has drifted from the Git repository. Changes committed to Git are not being applied. Extended failures may indicate a broken deployment or dependency issue.

Impact

New deployments or configuration changes from Git are not applied
Platform components may be running stale configurations
If a core component fails (Istio, Kyverno, monitoring), downstream services may be affected
Security patches committed to Git are not being rolled out

Investigation Steps

Get the status of all Flux Kustomizations:

flux get kustomizations -A

Get the status of all HelmReleases:

flux get helmreleases -A

Identify the specific failing resource and check its events:

flux logs --kind=Kustomization --name=<name> --namespace=flux-system
flux logs --kind=HelmRelease --name=<name> --namespace=<namespace>

Check the Flux source-controller for Git repository sync issues:

flux get sources git -A
kubectl logs -n flux-system deployment/source-controller --tail=100

Check the Flux helm-controller for Helm-specific errors:

kubectl logs -n flux-system deployment/helm-controller --tail=100

Check the Flux kustomize-controller:

kubectl logs -n flux-system deployment/kustomize-controller --tail=100

Verify Flux system pods are running:

kubectl get pods -n flux-system

Check for resource conflicts or validation errors:

kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Check if the HelmRelease has dependency issues:

kubectl get helmrelease <name> -n <namespace> -o yaml | grep -A 10 dependsOn

Resolution

HelmRelease stuck in "not ready" due to failed upgrade

Check the Helm history:

helm history <release-name> -n <namespace>

If a bad revision exists, let Flux retry:

flux reconcile helmrelease <name> -n <namespace> --with-source

If retries are exhausted, reset the release:

flux suspend helmrelease <name> -n <namespace>
helm rollback <release-name> <last-good-revision> -n <namespace>
flux resume helmrelease <name> -n <namespace>

Kustomization failing due to invalid YAML

Check the error message in the Kustomization status:

kubectl get kustomization <name> -n flux-system -o yaml | grep -A 5 'message:'

Fix the YAML in the Git repository
Push the fix and force reconciliation:

flux reconcile source git sre-platform -n flux-system
flux reconcile kustomization <name> -n flux-system

Git source not syncing

Check the GitRepository status:

flux get sources git -A
kubectl describe gitrepository sre-platform -n flux-system

Verify Git credentials are valid:

kubectl get secret flux-system -n flux-system -o yaml

Test connectivity from the cluster to the Git repository:

kubectl run -n flux-system --rm -it --restart=Never curl-test --image=curlimages/curl:8.4.0 -- curl -I https://github.com

Dependency failure cascading

If component B depends on component A, and A is failing:

Fix component A first
Then reconcile B:

flux reconcile helmrelease <component-a> -n <namespace-a>
# Wait for A to become ready
flux reconcile helmrelease <component-b> -n <namespace-b>

The dependency chain is: istio-base -> cert-manager -> kyverno -> monitoring -> logging -> openbao -> harbor -> neuvector -> keycloak -> tempo -> velero

HelmRelease stuck with "another operation in progress"

Check for stale Helm secrets:

kubectl get secrets -n <namespace> -l owner=helm

If a pending install/upgrade secret exists, remove it:

kubectl delete secret sh.helm.release.v1.<name>.v<version> -n <namespace>

Resume reconciliation:

flux reconcile helmrelease <name> -n <namespace>

Prevention

Always run task lint before pushing changes to Git
Use flux diff kustomization to preview changes before committing
Pin exact chart versions in HelmReleases (never use * or ranges)
Monitor gotk_reconcile_condition metric in Prometheus for early drift detection
Set up Grafana alerts on Flux reconciliation duration and failure count
Test HelmRelease changes in a dev environment before promoting to production

Escalation

If Flux system pods are crash-looping: escalate to platform team immediately
If Git source is unreachable for more than 30 minutes: check network/firewall rules and Git hosting service status
If multiple HelmReleases fail simultaneously: likely a shared dependency issue -- start from the root of the dependency chain

🕸️ Ada Research Browser